Mpileup speedup #2

jkbonfield · 2016-05-26T16:50:39Z

Speed ups to bcftools mpileup -Ou file.bam (ie simplest mode, default options, no ref and uncompressed BCF output).

The get_position() function now caches the read length so it doesn't have to scan through the entire cigar string for each and every base it operates on. This was O(N^2) complexity.

WARNING: it does this by shoehorning it into an unused field in BAM.
TODO: bite the bullet and break the ABI so we can put something into pileup1_t instead. I'll leave this up to you to thrash out. Probably fix the ghastly "aux" name too while at it. :-)

The speed gain is HUGE on long low-accuracy reads.
Removal of some floating point divisions in bcf_call_glfgen(). We had things like baseQ/60.0 * nqual. Given the size of baseq and nqual we can do (baseQ*nqual) / 60 in integer instead.

I tried changing the epos division too but it produces different results. This is actually due to a bug in the existing code with rounding errors. E.g. the integer math got a correct 58 while the floating point match got 57.9999999 and rounded down to 57.

If you're happy to have different results, consider changing the epos calc to: int epos = ((int64_t)pos*bca->npos)/(len+1);. Instead I moved the initial division earlier up the function to give it time for the result to be computed before we use it.
The Mann Whitney 1947 function is now precomputed for all the values potentially used in this code. See mw.h. Anything outside the bounds is computed on the fly as before, just incase I missed something.
Calc_mwu_bias() main loop now has specific code for dealing with one or both of the a[] and b[] arrays being zero. This seems to be a significant speed gain given the sparsity of these arrays.
Lots of "TRACK_EXTENTS" bits; see the #define in bcf2bam.h.
Basically this keeps track of the maximum values filled out in the bca->{ref,alt}_{epos,ibq,imq} arrays. It then uses these to shortcut some of the whole-array calculations as once it hits the runs of zeros the results don't change. The speed difference is very slight though. I leave it there for you to test, but the ifdefs show clearly how to cull it if you deem it not worth the extra complexity.
We no longer query the RG field over and over again for each base in the read. There's nowhere to put the value though! So in a complete hack I stashed it in the bam insert size. I don't recommend keeping this as is, but it works for now until we figure out the correct form of the new ABI/API. This is simply demonstration that caching such things is a significant win.

Other potential improvements:

The main loop is still column by column and within each column then seq by seq. We have many cases where the same pileup array occurs for many columns, but we always throw it all away and recompute. All we really need to do is a memmove left 1 byte.

Similarly when finding the sample groupings, we do this over and over again despite the fact they're not changing.

Fixing this is a larger restructuring that I didn't want to do myself. It is left as an exercise to the reader, but there are potentially huge speed gains to be had here.

On an Illumina data set (10 million reads), the timings went from 21m15s to 12m33s (approx 70% faster). On a PacBio data set the time dropped from 5m17s to 0m13s!

- rewritten for greater experimental flexibility - the AA peak can be included in the fit (the -i option) - the minimum fraction of aberrant cells as a command line parameter (-m) - control the minimum cn4 bump size (-b)

…r CN4

…lute; hidden --force-cn option for debugging

This was hidden as it was/is experimental, but with the licensing additions, users are explicitly requesting the command by compiling with USE_GPL=1, so we might as well display in the help message. Closes samtools#280

updates to polysomy command for publication

Resolves samtools#248

…g in -e -

updates to roh command for publication

…ings: <snps|indels|both|all|none>

bcf_hdr_combine deprecated and replaced by bcf_hdr_merge in htslib (2bb9370f5a24938d8a2dc56f404e584661bf413f). Fixes samtools#208

… well)

…erying FMT vectors, such as PL{0}

- CNx state removed - automatic optimization of parameters to increase sensitivity when only a fraction of cells is aberrant (at the cose of decreased specificity) - LRR smoothing

…mean

See also 1f81d25 and samtools/htslib@3fcf7c9 Fixes samtools#285 Petentially fixes samtools#284

…m a VCF Closes samtools#294

…ferences in FMT/AD

* add tests for samtools#452 and samtools#439 Resolves samtools#452

Not functional yet. Just copying over the files. * bam_plcmd.c for the mpileup command * bam2bcf.[ch] bam2bcf_indel.c for the VCF/BCF creation * sample.[ch] for RG:SM handling

@pd3

minimal changes to files copied from samtools in order to compile in bcftools. * use regidx from htslib rather than bedidx from samtools * remove sam_opts calls as sam_opts.h not copied from samtools Todo: * copy over relevant functionality from sam_opts.h * remove text based mpileup output * update options and defaults * bring over `---gvcf` and other changes from @pd3 fork

* deprecate `-g -v -u` options (still functional, but with warning) * exit with message to use `samtools mpileup` if `-s/--output-MQ` used * `-O` option was `--output-BP` and is not `--output-type` for consistency with other bcftools commands. If `[buzv]` not given as an option will warn These catches for old text output options are probably not necessary as users may not expect text output from `bcftools mpileup`.

* mpileup.1.out, mpileup.2.out, mpileup.3.out and mpileup.4.out are from samtools with mpileup.1.out and mpileup.3.out converted from the text output * mpileup.5.out a new test with the newer AD, etc annotations * sam/bam/cram test files all stored. perhaps there is some way to store one version and convert within the test ala the vcf-miniview in samtools?

adds to @4e7c8fb86349761fed1b290357dbc792222ecdcb

* remove deprecated `-g`, `-v`, `-u`, `-D`, `-V`, `-S` * remove `-R` short option to make way for `--regions-file` option later

@pd3

This commit brings over the `--gvcf` functionality from @pd3's branch, consisting of relevant bits from cf3219c and ee8210d Reference only blocks will be merged into gVCF blocks when the minimum per-sample depth falls in the intervals defined by the argument to the `-gvcf` option. Documentaion added to explain the merging and a test added.

pulling over of cf5c354 adding in `-S,--samples-file` option and exiting if no samples are read from the file or list TODO: add exclude logic with `^` prefix as in other bcftools commands removed `config.h` from `sample.c` as leftover this is in samtools, but not bcftools at the moment

switch `-t/--output-tags` option to `-a/--annotate` to make room for the `-t/--targets` option available annotations are now listed on request with `-a ?` rather than cluttering up the help output.

This is meant as a temporary change while we extend the regidx api, but allow bcftools code to use these changes before they appear in some form in htslib. This commit does not add new features, just copies over `regidx.[ch]` and rejiggers the linking to use these local bcftools copies. the `*_c` are removed due to relying on `hts_internal.h` (see fc9aeb6f77668afed412119701c5c58b0fca8091)

* added functions to loop over all regions * lazy index build in case random access is not required * support for chromosome names only, beg-end coordinates not mandatory * set cap at maximum coordinate at 2147483647, hts_itr does not support larger * tab and reg parsers will throw on finding a `0` to catch user error of using 0-based rather than 1-based coords

* `-r/--region` replaced by `-r/--regions` which will accept a comma separated list of regions as in other bcftools commands. `--region` still accepted * `-R/--regions-file` option added to read regions from a file This commit lifts over work originally done in cfd7cf9 Note: when more than one region is given, all indices are stored in memory, which can be a problem when running on many bams. An alternative would be to cache pre-filled `hts_itr`'s for each region. Resolves samtools#369

see samtools/samtools@91283dd

…ools commands the point of `--no-version` is to remove invocation specify metadata in the header lines for pipeline systems that are tracking this separately. we are outputting the `##reference` line though in mpileup. could drop this as well when `--no-version` used. seems silly to add a separate option.

* prefix with ^ to negate the selection * assign/rename samples by providing second field: RG_ID_1 SAMPLE_A RG_ID_2 SAMPLE_A RG_ID_3 SAMPLE_B * on read group name conflict give the alignment file, asterisk for all reads in the file: RG_ID_1 FILE_1.bam SAMPLE_A RG_ID_2 FILE_2.bam SAMPLE_A * FILE_3.bam SAMPLE_C Resolves 4th item in samtools#414 (comment) and samtools/samtools#324.

Our first foray into exploiting this is to cache the bam_smpl_get_sample_id return value. We compute this once in the constructor (the first time we see a new bam1_t) instead of for every pileup location in group_smpl. TO DO: Group_smpl itself could now become distributed perhaps. Rather than an N*M loop clustering all sample IDs together, each new bam could be added to sample struct on first appearance and removed when it goes out of sight. The slight caveat preventing this from being implemented immediately is that the constructor/destructors are called for every BAM overlapping the region rather than every filtered base that ultimately ends up in a pileup. Indeed sometimes we get constructors for reads entirely filtered out.

Mann Whitney test now uses a precomputed table instead of continually calculating the same values many times over. calc_mwu_bias now has short-cuts to compute the result faster when one or both of a[] and b[] hold zero values.

Merge upstream changes

pd3 and others added 30 commits June 13, 2015 06:17

Hidden switch for unscaled output in polysomy

f07f0fb

Tuning the parameters and fitting procedure, all RA peaks included fo…

62764c0

…r CN4

polysomy: CN4 side peaks symmetry check now relative rather than abso…

983bea4

…lute; hidden --force-cn option for debugging

minor wording and whitespace changes to polysomy.c

57edcab

show polysomy command in help message

9480a54

This was hidden as it was/is experimental, but with the licensing additions, users are explicitly requesting the command by compiling with USE_GPL=1, so we might as well display in the help message. Closes samtools#280

update copyright dates for peakfit

68b06b7

Merge pull request samtools#286 from samtools/feature/polysomy

96e7858

updates to polysomy command for publication

Implements "bcftools view -G removes all format header lines"

baa24a8

Resolves samtools#248

The HMM structure becomes opaque for future flexibility

8fdd498

RoH bugfix: Transition probabilities must scale to 1

f506244

New bcftools roh --AF-dflt option; changed -a,-H defaults; fixed a bu…

3ba5f38

…g in -e -

Fix polysomy dependencies [minor]

aea4cf9

Merge pull request samtools#287 from samtools/feature/roh

626527c

updates to roh command for publication

plot-vcfstats: Fix edge case in tstv_by_qual plot, see samtools#229

da26c06

concat: new --compact-PS switch

4eec7d5

concat -d option to remove duplicates based on one of the collapse sr…

4abc0f2

…ings: <snps|indels|both|all|none>

concat -l: Print warning and proceed if GT not present

566df3c

bcf_hdr_combine replaced by bcf_hdr_merge

0f15ea8

bcf_hdr_combine deprecated and replaced by bcf_hdr_merge in htslib (2bb9370f5a24938d8a2dc56f404e584661bf413f). Fixes samtools#208

More control duplicate lines removal with norm (REF still not treated…

7e7a7db

… well)

Output missing FMT fields as "." rather than negative integer when qu…

4bfe242

…erying FMT vectors, such as PL{0}

Updated cnv command:

ddc0bef

- CNx state removed - automatic optimization of parameters to increase sensitivity when only a fraction of cells is aberrant (at the cose of decreased specificity) - LRR smoothing

cnv -O: Increase robustness by smoothing HMM lks before updating the …

d2b012f

…mean

cnv: Update to the modified HMM API

2b50604

cnv: Prior on CN2; changed the default -P to be more strict (0.5)

d3f9fe4

Further support for case-insensitive REF,ALT allele merging

733f257

See also 1f81d25 and samtools/htslib@3fcf7c9 Fixes samtools#285 Petentially fixes samtools#284

more informative error message: annotate supports FMT fields only fro…

cecff77

…m a VCF Closes samtools#294

merge: More informative error message when FILTER is undefined

ae5663d

Resolved assert when comparing long GT strings

58a1cf3

plugin: new --version switch

0cf6efb

pd3 and others added 27 commits June 28, 2016 18:03

plugin: new plugin to detect sites with statistically significant dif…

c78a6f5

…ferences in FMT/AD

ad-bias: do not skip first line of samples file; added test

dd11889

Merge branch 'feature/ad-bias' into develop

da5e8a0

norm: avoid regression when ALT is a spanning deletion *

e43ca8c

* add tests for samtools#452 and samtools#439 Resolves samtools#452

copy over mpileup files from samtools

fe43a83

Not functional yet. Just copying over the files. * bam_plcmd.c for the mpileup command * bam2bcf.[ch] bam2bcf_indel.c for the VCF/BCF creation * sample.[ch] for RG:SM handling

rename bam_plcmd.c -> mpileup.c

e46a2cf

bring over mpileup documentation

b20d1e7

mpileup: remove printw() which was only used in the text output

6546afd

adds to @4e7c8fb86349761fed1b290357dbc792222ecdcb

remove deprecated options

acb1b9b

* remove deprecated `-g`, `-v`, `-u`, `-D`, `-V`, `-S` * remove `-R` short option to make way for `--regions-file` option later

mpileup: switch --output-tags option to --annotate

d2f3068

switch `-t/--output-tags` option to `-a/--annotate` to make room for the `-t/--targets` option available annotations are now listed on request with `-a ?` rather than cluttering up the help output.

mpileup: in read_file_list(), check for URL syntax or extant files

ceb0a20

see samtools/samtools@91283dd

mpileup: enable -g short --gvcf option that was missed

c552060

mpileup: exclude samples when prefixed with ^

2607f11

mpileup: -l replaced with -t,-T to match other bcftools commands

51e31af

rename sample.* to bam_sample.*

7f4929c

Speed up of mpileup funcs.

c333e8b

Mann Whitney test now uses a precomputed table instead of continually calculating the same values many times over. calc_mwu_bias now has short-cuts to compute the result faster when one or both of a[] and b[] hold zero values.

pd3 force-pushed the develop branch from cabda58 to c137015 Compare November 16, 2016 07:31

pd3 force-pushed the develop branch from fce297d to 0668ecc Compare July 23, 2018 12:05

pd3 pushed a commit that referenced this pull request Feb 12, 2021

Merge pull request #2 from samtools/develop

872d506

Merge upstream changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpileup speedup #2

Mpileup speedup #2

jkbonfield commented May 26, 2016

Mpileup speedup #2

Are you sure you want to change the base?

Mpileup speedup #2

Conversation

jkbonfield commented May 26, 2016